Quantitative Methodology (UPF)
Install packages
Load packages
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(countrycode)
library(janitor)
library(readxl)
library(wbstats)
library(naniar)
library(haven)
library(Hmisc)“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.
Source: R for Data Science
A df is tidy if it fulfills these requirements (Wickham 2014):
But …
Votes to parties in the 2019 Spanish elections
# A tibble: 15 × 9
provincia municipio votos…¹ pp cs podem…² mas_p…³ eb pacma
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Almería Abla 710 193 47 30 0 2 3
2 Almería Abrucena 699 111 45 42 0 0 2
3 Almería Adra 10941 3126 574 645 0 16 81
4 Almería Albánchez 272 56 41 18 0 0 2
5 Almería Alboloduy 419 155 17 10 0 0 2
6 Almería Albox 4335 1261 378 247 0 2 28
7 Almería Alcolea 509 92 39 43 0 1 11
8 Almería Alcóntar 371 125 20 15 0 2 3
9 Almería Alcudia de Monteag… 106 47 5 6 0 0 0
10 Almería Alhabia 409 95 35 23 0 0 1
11 Almería Alhama de Almería 1984 402 103 182 0 0 18
12 Almería Alicún 150 38 11 7 0 0 3
13 Almería Almería 92383 24495 7848 9699 0 204 964
14 Almería Almócita 103 10 6 16 0 1 0
15 Almería Alsodux 81 16 1 9 0 3 3
# … with abbreviated variable names ¹votos_validos, ²podemos_iu, ³mas_pais
But …
Minor contracts granted in Barcelona
# A tibble: 15 × 5
ens contracte data proveidor import
<chr> <chr> <date> <chr> <dbl>
1 CONSORCIS Serveis NA Marc Redorta 1646.
2 CONSORCIS Serveis NA JESUS SANZ LOPEZ 360
3 CONSORCIS Serveis NA JOAN BOSCH MUNT… 690.
4 CONSORCIS Serveis NA ALLIANZ, Compañ… 456.
5 FUNDACIONS I ASSOCIACIONS <NA> NA Vopi4Elecnor 1891.
6 FUNDACIONS I ASSOCIACIONS Serveis 2918-11-19 Jaume Badosa 1754.
7 ORGANISMES AUTÒNOMS LOCALS Serveis NA WHADS MEDIA STU… 1392.
8 ORGANISMES AUTÒNOMS LOCALS Serveis NA AJUNTAMENT BARC… 4655.
9 ORGANISMES AUTÒNOMS LOCALS Subministrament NA INADHOC HABITAT 9859.
10 ORGANISMES AUTÒNOMS LOCALS Serveis NA DOT CONSULTING 18090.
11 ORGANISMES AUTÒNOMS LOCALS Subministrament NA SAYTEL 15562.
12 ORGANISMES AUTÒNOMS LOCALS Subministrament NA ID GRUP 1197.
13 ORGANISMES AUTÒNOMS LOCALS Subministrament NA ID GRUP 2117.
14 ORGANISMES AUTÒNOMS LOCALS Serveis NA CODINA, ESTHER 1600
15 ORGANISMES AUTÒNOMS LOCALS Serveis NA ABELEIRA, ANTON… 1200
But …
Price of rents
# A tibble: 15 × 3
nom_barri preu preu_m2
<chr> <dbl> <dbl>
1 el Raval 590. 10.8
2 el Barri Gòtic 713. 10.6
3 la Barceloneta 541. 14.4
4 Sant Pere, Santa Caterina i la Ribera 673. 11.0
5 el Fort Pienc 736. 10.4
6 la Sagrada Família 673. 10.6
7 la Dreta de l'Eixample 921. 9.84
8 l'Antiga Esquerra de l'Eixample 828. 10.4
9 la Nova Esquerra de l'Eixample 716. 10.3
10 Sant Antoni 693. 9.77
11 el Poble Sec 568 10.2
12 la Marina del Prat Vermell NA NA
13 la Marina de Port 554. 8.34
14 la Font de la Guatlla 632. 10.8
15 Hostafrancs 581. 10.4
Income per capita
# A tibble: 15 × 3
nom_barri sc import_euros
<chr> <dbl> <dbl>
1 el Raval 1 27503
2 el Raval 2 21913
3 el Raval 3 24220
4 el Raval 4 27405
5 el Raval 5 23014
6 el Raval 6 25581
7 el Raval 7 25959
8 el Raval 8 21900
9 el Raval 9 24849
10 el Raval 10 23306
11 el Raval 11 21571
12 el Raval 12 25592
13 el Raval 13 27908
14 el Raval 14 26078
15 el Raval 15 26092
Often, data is not tidy because it’s:
Number of TB cases documented by WHO in Afghanistan, Brazil, and China between 1999 & 2000 (cases & population) (example from Wickham and Girlich (2022)).
Table A
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Table B
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Table C
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Table D
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
We change the rows and columns of the dataframe keeping the same information.
pivot_longer(df, cols, names_to, values_to)pivot_wider(df, names_from, values_from)separate(df, col, into, sep)unite(df, col, ..., sep)Three reasons, among others:
Zero vs. NA (Not Available)
Special categories: People do not answer.
# A tibble: 1,599 × 4
age language situation_country situation_personal
<chr> <chr> <chr> <chr>
1 26 Igbo Very Bad Fairly Good
2 25 Other Very Bad Fairly Bad
3 35 Hausa Fairly Good Very good
4 79 Other Neither good nor bad Neither good nor bad
5 19 English Very Bad Fairly Good
6 34 Igbo Don't know Don't know
7 30 Pidgin English Very Bad Very Bad
8 32 Hausa Fairly Good Very good
9 50 Other Neither good nor bad Neither good nor bad
10 18 English Very Bad Neither good nor bad
# … with 1,589 more rows
Special categories: Hard to qualify.
# A tibble: 28 × 6
ccode scode country year polity polity2
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 560 SAF South Africa 1991 5 5
2 560 SAF South Africa 1992 -88 6
3 560 SAF South Africa 1993 -88 8
4 560 SAF South Africa 1994 9 9
5 560 SAF South Africa 1995 9 9
6 560 SAF South Africa 1996 9 9
7 560 SAF South Africa 1997 9 9
8 560 SAF South Africa 1998 9 9
9 560 SAF South Africa 1999 9 9
10 560 SAF South Africa 2000 9 9
# … with 18 more rows
Not known (NA) or strange/extreme value.
# A tibble: 7 × 8
any trim codi_dist districte codi_barri nom_barri preu preu_m2
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 2014 1 8 Nou Barris 47 Can Peguera 242. 6.03
2 2014 2 9 Sant Andreu 58 Baró de Viver 231. 4.03
3 2014 3 8 Nou Barris 47 Can Peguera 145. 3.54
4 2015 3 9 Sant Andreu 58 Baró de Viver 187 4
5 2016 1 9 Sant Andreu 58 Baró de Viver 142. 3.18
6 2016 3 8 Nou Barris 54 Torre Baró 0 NA
7 2016 3 9 Sant Andreu 58 Baró de Viver 256. 3.85
Data join typically produces NA.
Some functions:
summary() for exploration.is.na() is useful when filtering.na.rm = TRUE argument in summary functions.replace_na() assigns a value to NAs.naniar::vis_miss() (be careful!!)The real problem is when the reason why data is missing is related to what we are studying:
full_join()left_join()right_join()inner_join()Examples:
Quantitative Methodology (UPF)